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ABSTBACT 


Computation  bound  problems  inpose  a  severe  burden  on 
the  CFO.  In  order  to  speed  up  computation,  specific  problems 
that  are  identified  as  the  main  burden  can  be  done  using 
parallel  processing.  In  this  way,  the  time  consuming  tasks 
can  be  executed  on  specially  tailored  hardware.  This  hard¬ 
ware  is  designed  to  implement  an  algorithm-oriented 
parallel-processing  structure  that  works  more  efficiently 
than  the  CPU  for  these  specific  tasks.  This  thesis  is  a 
study  of  the  mapping  cf  the  algorithms  onto  a  kind  cf  struc¬ 
ture  called ^rrstolic  array.  The  development  and  utilization 
of  a  software  tool  designed  to  assist  on  such  analysis  is 
presented  here.  This  tool,  named  Systolic  Array  Graphics 
Simulator  (SYSGHAS) ,  has  the  capability  to  represent  any 
type  of  systolic  array,  no  matter  how  complex  the  cells  and 
structure  are.  Because  of  the  capability  of  SYSGEAS,  an 
interactive  computer  program  simulator,  the  study  of 
systolic  arrays  is  simplified.  The  complexity  of  the  time- 
space  relationships  is  analyzed  with  the  help  of  seme  color 
graphics  technigues.  The  visualization  of  the  data  interac¬ 
tion  is  thus  enhanced  and  the  user  is  alleviated  from  the 
burden  of  keeping  track  of  partial  results  and  can  dedicate 
attention  to  the  processing  algorithm.  ' 
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1.  A  FIRST  II SIGHT 

"A  systolic  system  is  a  network  of  processors  which  rkyth- 
aically  compute  and  pass  data  through  the  system. 
Physiologists  use  the  word  "systole"  to  refer  to  the 
rhythmically  recurrent  contraction  of  the  heart  and 
arteries  which  pulses  blood  through  the  body.  In  a 
systolic  computing  system ,  the  function  of  a  processor  is 
analogous  to  that  of  the  heart.  Every  processor  regularly 
pumps  data  in  and  out,  each  time  performing  some  short 
computation,  so  that  a  regular  flow  of  data  is  kept  in  the 
network. " 

"...matrix  computations  can  be  pipelined  elegantly  and 
efficiently  on  systolic  networks  having  an  array 
structure." 

"...special  purpose  hardware  devices  based  on  systolic 
arrays  can  be  built  inexpensively  using  the  VLSI 
technology." 

These  are  parts  of  the  abstract  of  the  paper  where 
Professors  H.T.  Rung  and  C.E.  Leiserson  have  first 
presented  the  concept  of  systolic  arrays  [fief.  1].  They 
fully  describe  the  concept  and  context  to  which  the  term 
"systolic  array"  applies. 

Several  authors  have  presented  papers  on  systolic  arrays 
in  recent  years,  but  this  subject  is  very  new  and  everything 
dates  tack  to  1978. 

B.  SYSTOLIC  ARRAYS:  TB1  BASIC  PRINCIPLE 

A  systolic  system  consists  of  a  set  of  interconnected 
cells,  each  capable  of  performing  some  simple  operation. 


Information  flovs  between  cells  in  a  pipelined  fashion,  and 
coaa unication  with  the  outside  world  occurs  only  at 
"boundary  cells". 

The  usual  computational  tasks  are  basically  divided  into 
two  categories — computing  bound  and  input/output  (I/O) 
bound.  If  the  tctal  number  of  computing  operations  is  much 
larger  than  that  of  I/O  operations,  then  we  have  a  computing 
bound  task.  Systolic  arrays  are  useful  to  speed  up  this 
type  cf  task.  The  array  is  able  to  use  each  input  element  a 
number  cf  times  thus  achieving  a  high  computational 
throughput  without  increasing  memory  bandwidth.  £Bef.  2] 

C.  Of  VISED  ABCHITECTOBES 

A  number  of  architectures  have  already  been  proposed  for 
solving  the  traditional  problems.  These  are  linear  arrays 
for  performing  matrix-vector  multiplication  and  finding 
solution  of  triangular  linear  systems;  two-dimensional 
arrays  for  matrix-matrix  multiplication/addition,  least 
squares  solution  and  10  factorization, etc  £Bef.  3]. 

D.  HABVABE  EXPLOB1TCBT  DEVEIOIHEHT 

Because  this  subject  is  new,  much  of  the  development  is 
still  in  an  exploratory  stage.  In  order  to  further  evaluate 
the  hardware  potential  of  systolic  arrays,  TES/ESI  has 
developed  a  processor  called  Ehoenix  that  has  already  been 
used  tc  provide  two-dimensional  FIH  filters  for  image 
processing  £Be£.  3:  page  3].  To  evaluate  systolic  algo¬ 
rithms  and  architectures,  the  Naval  Ocean  System  Center  has 
developed  a  reconf igurable  cne-two  dimensional  systolic 
array  with  6h-processing  elements  £Bef.  h ].  This  systolic 
testbed  processor  can  be  reconfigured  under  software  control 
to  perform  as  a  rectangular,  hexagonal,  or  linear  systolic 


array.  The  main  goal  of  its  design  was  to  achieve 


flexibility  in  algorithm  evaluation,  and  its  cells  are  atich 
slower  than  those  used  in  the  TBW/ESL  Phoenix  processor  or  a 
possible  implementation  in  a  VISI  chip. 


1.  1H1  107  TO  SXPLOBI  XT  BITH  SOFT VIBE? 

During  the  preliainary  investigations  in  this  field,  the 
approach  was  to  select  a  case  study  and  try  to  go  deeper  in 
the  process  of  understanding  the  way  a  particular  algorithm 
is  aapped  onto  a  systolic  architecture.  It  was  discovered 
that  it  is  not  always  straightforward  to  understand  the 
napping  process.  An  algoritha  like  the  one  proposed  by  H.T. 
Kung  and  V.n.  Gentleman  for  doing  matrix  triangularization 
can  be  really  difficult  to  understand  if  one  dees  not 
possess  an  adeguate  tcol  [Bef.  7:  page  22].  The  algebraic 
manipulation  can  be  difficult  when  dealing  with  systolic 
arrays  because  the  interaction  of  the  processors  tend  to 
lead  to  a  very  complicated  set  of  equations  in  only  a  few 
steps  and  it  becomes  difficult  to  achieve  any  generaliza¬ 
tion.  Seme  notations  have  been  proposed  to  help  in  the 
modeling  the  parallel  processing  £Ref.  5],  but  no  systetatic 
method  for  mapping  an  algorithm  onto  an  array  exists.  There 
is  a  lack  of  the  appropriate  tools  for  systolic  algorithm 
development.  However,  with  the  development  of  graphics 
technclogy  a  means  now  exists  that  can  be  used  to  help 
people  better  understand  how  a  systolic  array  works.  If  we 
conjugate  this  possibility  with  interactive  programming 
technigues,  it  may  be  possible  provide  an  user  friendly  tcol 
that  can  be  used  tc  achieve  the  goal  which  is  the  main 
guideline  of  this  investigation:  to  provide  <tn  easier  way 
to  understand  systolic  array  operation.  A  case  study  will 
be  presented  in  Chapter  III  that  turned  towards  performing 
Matrix  Triangularization  with  the  Givens  Rotations 
Algoritha.  Eventually,  the  systolic  processing  can  be 
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understood  and  checked  against  real  data  vith  the  use  of  our 
Systolic  Array  Graphics  Simulator  (SYSGBAS) .  This  approach, 
in  ccntiast  vith  that  folloved  by  researchers  at  the 
S.O.S.C.,  in  San  Diego,  is  to  take  full  advantage  of  the 
software,  using  coaputer  graphics  presentation  to  iaprcve  or 
to  add  another  diaension  to  the  aan/machine  interface. 

Anyway  we  want  tc  point  out  that  this  work  oust  be  seen 
as  an  investigation  cf  the  possibilities  of  using  this  kind 
of  facilities  as  tools  rather  than  as  a  final  product.  Only 
experiaentation  in  practical  work  can  show  the  advantages  or 
the  liaitations  of  this  approach.  As  in  any  other  area  of 
engineering,  feedback  froa  users  is  iaportant  in  the  evalua¬ 
tion  and  future  iaprcvenent  of  this  tool. 


iz.  IMS  MISUiXSB  A££iSA£3 


The  Systolic  Array  Graphical  Siaulator  (STSGRAS)  is 
implemented  to  allow  interactive  design  of  any  type  of 
systolic  array.  The  cells  and  their  interconnection  links 
are  defined  by  the  user  in  a  guided  aanner.  The  optional 
cell  types  are  built  into  the  S7SGBAS.  The  user  can 
construct  any  array  using  the  existing  cell  types  or  by 
introducing  new  cell  types,  but  the  introduction  of  a  new 
cell  type  cannot  be  done  interactively.  This  requires  a 
subroutine  to  describe  and  support  that  type,  but  the  opera¬ 
tion  is  simple  and  will  be  explained  later.  The  presently 
available  types  can  support  aost  of  the  algorithms  found  in 
the  literature. 

S75GBAS  is  designed  with  a  requirement  of  being  user 
friendly.  It  has  been  realized  that  the  aaount  of  data  that 
is  nciaally  required  to  be  entered  by  a  user  in  a  session  is 
quite  large.  It  is  important  to  offer  the  possibility  of 
correcting  errors,  reviewing  entered  data,  and  storing  data 
and  results  from  the  session.  It  can  also  be  used  later  in 
another  session.  The  sequence  of  operations  required  during 
a  session  is  shown  below: 

1)  Establish aent  cf  general  conditions: 

-  new  or  previous  problem  to  run. 

-  need  to  store  the  calculations  for  later 
printing. 

-  ability  tc  interrupt  or  review  the  progress  of 
systolic  processing  on  a  clocked  basis. 

2)  Definition  of  cells*  attributes: 

-  type  of  piccessor  in  the  cell. 

-  graphical  shape  of  the  cell. 

-  screen  coordinates  cf  the  cell. 
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-  interfaces  (importer  cell,  receiver  of  external 
data;  internal  cell,  only  exchanges  data  with 
other  cells;  exporter  cell,  data  transmitter 
to  external  vorld) . 

3)  Definition  of  links: 

-  define  the  origin  and  destination  of  each  link. 
The  origin  is  a  port  in  a  cell.  Each  cell  has  a 
number  of  output  and  input  ports.  The  function 
of  each  cne  depends  on  the  processor  type. 
The  destination  is  a  input  port  in  another 
cell.  A  link  is  defined  by  origin  cell 
output  port  and  the  destination  cell  input 
port. 

4)  Presentation  on  the  graphical  screen: 

-  the  pictcxe  of  the  user  defined  array  is 
initially  presented  on  the  screen  with  all 
processor  registers  initialized  to  zero.  Since 
the  presentation  is  strongly  based  on  colors, 
the  non  existence  of  data  interaction  in  the 
cells  aakes  them  all  turn  into  black.  The 
identification  number  given  by  the  user  to 
each  cell  is  placed  at  the  cell's  lower  right 
corner  and  red  arrows  indicate  the  linking 
between  cells  and  the  direction  of  data  flux 
(see  Fig.  4.7  at  Chapter  17). 

5)  Data  input  and  screen  update: 

-  as  soon  as  data  originated  from  the  external 
world  to  the  importer  cells  is  entered,  the 
whole  screen  is  updated  and  representing  the 
status  of  the  systolic  array  at  that  partic- 
ular  clock  cycle.  The  external  data  are  entered 
at  each  clock  cycle,  and  the  screen  is  updated 
accordingly. 


6)  Beviev  of  the  ccaplete  systolic  process: 

-  optionally,  the  user  can  review  the  coaplete 
process.  The  total  data  generated  in  a 
session  are  stored  automatically,  and  the 

graphics  presentation  of  any  previous  session 

can  be  seen  again  without  any  additional 

data  entry. 

The  central  idea  of  the  STSGRAS  is  to  allow  a  user  to 
construct  any  desired  systolic  array  and  to  use  as  the  input 
test  data  such  that  insight  can  be  gained  about  the  data 
interactions  of  the  systolic  algorithm.  The  user  has  the 
capability  to  correlate  the  propagation  of  the  input  data 
with  the  colors  of  the  cells  to  better  understand  the 
process.  Certainly,  the  greater  the  creativity  of  assigning 
the  color,  the  better  the  result  shoving  the  interactions. 

There  is  no  restriction  as  to  how  to  do  the  correlation  and 

several  exaaples  vill  be  presented.  The  colors  that  are 
available  to  a  input  datua  are  red,  green  and  blue.  If  a 
cell  receives  data  frcs  aore  than  one  source  with  different 
primary  colors,  these  colors  vill  combine.  The  cell  vill 
shou  on  the  screen  with  a  resulting  color  that  is  the  combi¬ 
nation  of  the  input  primary  colors.  This  resulting  color  is 
not  passed  on  to  other  cells.  The  propagation  of  data  is 
associated  with  the  primary  colors,  and  this  way  a  good 
tracking  of  the  timing  that  takes  for  each  data  wavefront  to 
propagate  through  the  array  can  be  achieved.  The  screen 
display  shows  the  array  in  a  way  selected  by  the  user.  The 
cells  are  presented  with  the  shape,  position,  identification 
and  connections  reguested  by  the  user.  The  colors  and  numer¬ 


ical  results  assuaed  in  each  clock  cycle  show  the  data 
interaction.  The  numerical  results  that  appear  on  the 
screen  is  in  a  fixed  format  and  with  rather  limited  preci¬ 
sion.  If  one  wants  greater  precision,  there  is  an  option  to 


record  tie  session  which  can  shoe  results  with  up  to  five 
decimals  on  printer  output.  These  results  can  he  checked 
against  the  known  values  to  verify  the  correct  operation  of 
the  sapped  systolic  algorithm. 
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A.  A  BE ASCI  FOB  HATBIX  TBIAHGD1AB1ZATIOH :  QB  DECOMPOSITIOH 

The  technique  of  £B  decomposition  is  useful  to  solve  the 
linear  least  Squares  Problem.  There  are  other  possible 
approaches  in  the  literature  to  solve  this  type  of  problem 
[Bef.  6].  We  will  select  the  QB  Decomposition  Method  as 
mentioned  by  Gentleman  and  Kung  in  [Bef.  7:  page  19].  let's 
present  a  brief  explanation  of  the  method. 

Given  a  linear  system  defined  by 

AX  =  E 

where  A  is  a  (n  x  p)  matrix,  X  is  a  (p  x  1}  and  B  is  a  (n  x 
1)  column  vector.  We  want  tc  find  the  vector  X  such  that 
J jr I | =| | E-AX| |  is  minimized.  The  vector  X  is  also  called 
vector  o£  regression  coefficients.  This  is  in  fact  a  vector 
of  estimated  elements  and  as  such  it  strictly  should  be 
written  X  but  for  convenience  X  will  continue  to  be  used. 
Also  a  mere  rigorous  matrix  notation  would  be,  for  example, 
A  instead  of  A.  If  we  are  able  to  find  an  orthogonal  matrix 
Q  such  that  QA=R  wheze  B  is  an  upper  triangular  matrix,  then 
we  will  have 


QAX  =  BX  =  QB 
and,  as  a  result,  BX  =  Q3. 

As  B,  Q,  and  E  are  all  known,  we  can  find  X  by  Back 
Substitution.  Gentleman  has  shown,  as  seen  in  £  Bef.  8:  page 
329],  that  this  approach  solves  the  Least  Squares  Problem 
because  it  solves  the  normal  equations  of  the  problem 
[Bef.  6:  page  148]. 


An  accurate  QR  Decomposition  can  be  obtained  in  several 
ways,  like  the  Gram-Schmidt  Process  or  Householder 
Trans for sations.  For  this  particular  stud;  we  vill  stick  to 
the  Givers  Rotations  Eethod.  This  method  reguires  the  use  of 
a  seguence  of  plane  cctations  on  matrices  A  and  B,  that  will 
convert  the  linear  system  to  a  representation  of  the  fcrm 

RX  =  CB 

which  is  straightforward  to  solve. 

The  Givens  Rotations  Method  has  two  different  implemen- 
tations  [Ref.  8:  page  331].  Re  vill  stud;  the  one  called 
"with  sguare  roots". 

E.  QS  DECOMPOSITION  El  GIVENS  ROTATIONS  WITH  SQUARE  ROOTS 

If  Q  is  an  orthogonal  matrix,  Y-QA  is  called  an  orthog¬ 
onal  transformation  and  the  columns  of  Q,  sa;  g{1),  g(2), 
...  are  orthogonal.  In  order  to  better  understand  what  an 
orthogonal  transformation  means,  we  can  interpret  the  g(i)'s 
as  unit  vectors  (because  of  their  unit  length)  that  repre¬ 
sent  the  new  reference  frame  in  the  original  coordinates 
frame.  If  we  apply  an  orthogonal  transformation  to  a  matrix 
(or  to  a  vector) ,  we  will  be  in  fact  changing  the  reference 
coordinates  for  that  vector.  The  information  content  of  the 
transformed  matrix  (cr  vector)  will  be  the  same,  although 
different  from  previous  form  because  of  the  new  reference. 
The  columns  of  Q  are  given  by  tfae  direction  cosines  of  the 

£££&!£&££  Jiili  S^s^ect  £2.  ££e  &14  reference  system 
[Ref.  12:  page  354]. 

An  orthogonal  transformation  can  be  divided  into  a 
seguence  of  elementary  transformations.  In  the  case  of 
Givens  Rctations,  each  one  of  these  elementary  transforma¬ 
tions  is  a  rotation  about  one  coordinate  axis  of  the  orig¬ 
inal  reference  frame  (which  nay  be  n-dimensional) . 


la  geometry  we  knew  that  the  transformation 


cos(al)  cos  (x2)  cos  (x3) 
cos(yl)  cos(y2)  cos(y3) 
cos  (2 1)  cos(z2)  cos  (z3) 


applied  to  a  vector  [al,  bl,  cl]  will  rotate  the  original 
reference  frame  abort  the  3  axes  at  the  same  time.  The 
direction  cosines  of  the  new  x-axis  are  cos(xl),  cos(x2)  and 
cos  (x2) ;  the  direction  cosines  of  the  new  y-axis  are 
cos(yl),  cos  (y2)  and  cos  (y3)  ,  and  so  on.  The  transformation 
effect  achieved  by  the  aatrix  of  cosines  is  similar  to  that 
obtained  by  the  Givens  Rotation  transformation  matrix  Q. 

Re  can  split  the  cosines  aatrix  into  a  product  of 
matrices  each  one  representing  a  rotation  about  one  axis  of 
the  original  frame.  This  technique  of  decomposing  the  oper¬ 
ation  into  product  operations,  each  one  being  responsible  to 
rotate  the  system  aatrix  of  an  angle  TET&  about  one  axis  of 
the  criginal  frame,  is  exactly  what  is  done  in  Givens 
algorithm.  Each  individual  Givens*  Rotation  is  represented 
by  a  matrix  of  the  fora 


D(j,i)  = 


1  0  0  0  .  0  0 


0  10  0 


0  ..  0  c  0 


0  ...  0  1  0 


0  1  0  ...  0 


=*  i  th  row 


0  ...— s . . . .  0  c  0 


==  j  th  row 


where  sgr(c)  +  sgr  (s)  *  1,  c  =  cos  (TETA)  ,  s  =  sin  (TETA)  and 
dots  represent  zeros  £Bef.  6:  page  153].  This  elementary 

rotation  sill  convert  the  (j,i)  element  of  the  matrix  shich 
it  premultiplies  into  zero,  that  is,  the  (j,i)  element  of 
the  product  of  matrix  D(j,i)  and  A  sill  be 

-s.a  (i,i)  ♦  c.a  (j,i)  =  0 

and  sc  it  follows  that 

d  =  sgrt(sgr  (a  (i,i))  ♦  sgr(a[j,i))) 

c  *  a(i,i)/d  and  s  *  a{j,i)/d 

If  the  matrix  A  is  (n  x  n)  ,  it  will  be  necessary  to  have 
(n-1)  elementary  rotations  to  turn  all  elements  of  the  first 
column  (  with  exception  of  a  (1,1)  )  onto  zero,  (n  -  2)  rota¬ 
tions  for  the  second  column,  and  so  on.  For  a  (3  x  3) 
matrix,  three  rotations  will  suffice. 

C.  HAPPING  GIVEIS  BC1ATI0HS  ALGORITHM  INTO  A  SYSTOLIC  ARBAI 

The  problem  of  mapping  an  algorithm  into  a  systolic 
structure  freguently  turns  out  to  be  the  subject  of  certain 
restrictions  because  the  computed  results  may  result  in  very 
small  numbers  impossible  to  be  represented  if  fixed  point 
hardware  is  used.  In  crder  to  avoid  establishing  undesirable 
conditions  on  the  data,  one  must  try  to  select  an  algorithm 
possible  to  be  mapped  and  to  perform  computations  cn  any 
data  within  the  selected  range  of  approximations. 

He  have  already  presented  the  Givens  Algorithm.  The 
problem  now  is  to  understand  how  a  systolic  array  can 
realize  it.  The  reader  must  be  aware  that  the  understanding 
of  the  mathematical  algorithm  does  not  lead  to  immediate 
insight  cn  how  the  array  works.  For  the  original  discussion 
on  this  algorithm  the  reader  is  referred  to  £Ref.  7]. 


Presented  in  the  following  are  the  specifications  of  the 
cell  processors  used  in  this  systolic  array,  the  general 
cell  arrangement,  and  the  input  data  streans.  However,  the 
sight  of  such  specification  does  not  reveal  the  intricacies 
of  the  deduction  of  which  function  should  be  performed  by 
each  particular  cell  element.  A  possible  way  to  gain 
insight  as  to  how  the  algoritha  is  sapped  onto  the  array  is 
to  perfora  an  algebraic  validation  that  is  extremelly  labo¬ 
rious.  For  a  start,  suppose  that  we  want  to  triangular ize  a 
(3  x  3)  matrix  A.  We  nust  preiultiply  it  3  times  by  elemen¬ 
tary  rotation  matrices  which  will  turn  the  elements  a  (2, 1)  , 
a (3,1)  and  a  (3,2)  into  zeros.  The  first  transformation  will 
be 


c 

s 

0  * 

a (1, 1) 

a  ( 1, 2) 

a  (1,  3)  • 

-s 

c 

0 

• 

a  (2, 1) 

a<2,2) 

a  (2,  3) 

0 

0 

1 

.  a  (3,1) 

a  (3,2) 

a  (3,3)  . 

This  elementary  rotation  will  turn  element  (2,1)  into  null 
on  the  product  matrix.  We  can  note  that  the  element  (-S)  at 
the  transformation  matrix  occupies  the  position  of  the 
element  that  will  be  turned  into  zero  in  the  transformed 
matrix.  The  transformed  matrix  will  become 

'  a*  (1,1)  a'  (1,2)  a'  (1,3)  ' 

0  a*  (2,2)  a'  (2,3) 

.  a  (3, 1)  a  (3,2)  a(3,3) 

which  will  be  operated  by  a  second  transformation  matrix 


’  c 

0 

s  ‘ 

’  a '  (1 , 1) 

a*  (1,2) 

a'  (1,3)  ' 

0 

1 

0 

• 

0 

a*  (2,2) 

a*  (2,3) 

,-s 

0 

c 

.  a  (3,1) 

a  (3,2) 

a  (3,3)  . 

and  that  will  generate  the  matrix 


a”  *1,1)  a”  (1 , 2)  a"  (1,3) 
0  a*  (2,2)  a*  (2,3) 

0  a’ (3,  2)  a*  (3  #3) 


Finally  the  last  transformation 


1  0  0 
0  c  s 
0  -s  c 


a”  (1  /  1)  a"  (1/  2)  a"  (1,3) 

0  a*  (2,2)  a*  (2,3) 

0  a*  (3,2)  a'  (3,3) 


will  generate  the  upper  triangular  matrix 


a"  (1.1)  a"  (1,2)  a"  (1,3) 

0  a"  (2,2)  a”  (2,3) 

0  0  a”  (3,3) 


It  is  important  to  note  that  the  s’s  and  c's  are 
different  in  each  transformation  matrix  and  are  calculated 
the  way  we  have  already  pointed  out,  that  is,  for  the  first 
elementary  transformation, 

d  =  sgrt  (sgr  (a  (1, 1) )  ♦  sgr(a(2,1))) 

c  =  a(1,1)/d  and  s  =  a  (2,  1)/d 

Now  the  evaluation  of  the  elements  of  each  transformed 
matrix  should  be  considered  by  performing  the  actual  multi¬ 
plications  to  correlate  the  resulting  algebra  with  the  algo¬ 
rithmic  processor’s  equations  presented  in  Figs.  3.1  and 
3.2.  For  this  particular  algorithm  it  requires  definitely  a 
lot  of  paperwork.  This  cannot  be  accepted  as  a  good  method 
if  one  wants  to  understand  the  general  data  flow  in  the 


systolic  array,  and  to  check  its  numeric  data.  This  is  why 
we  developed  the  STSGBAS  to  have  the  possibility  of  doing  an 
evaluaticn  study  on  how  several  algorithms  found  in  the 
literature  do  perform. 

D.  1  HUBEBICAL  EXAHPIE 

In  order  to  test  the  Givens  Algorithm  in  doing  Natriz 
Triangularization,  we  have  prepared  a  numeric  study  case.  A 
system  is  represented  by  the  matrix  equation 

2  4  11  x{1)  12  ' 

5  7  4  .  x  (2)  =  3 

,  3  0  1  X(3)  8. 

In  Appendix  C  we  shew  how  to  operate  on  this  eguaticn  for 
triangulating  and  to  solve  it  by  Back  Substitution.  Using 
these  numerical  data,  we  will  set  up  the  problem  fer  the 
simulator  and  be  able  to  check  out  the  simulation  results 
against  our  hand  calculated  values. 

E.  SETTIHG  UP  THE  PfiCBLEN  FOB  SIMULATION 
1.  Sketching  the  Graphics 

The  first  step  in  preparing  the  problem  for  simula¬ 
tion  consists  of  planning  the  graphics.  We  have  tc  consider 
the  following  points: 

-  How  do  we  want  our  systolic  array  arranged  on  the 
screen? 

-  What  kind  of  shape  will  be  used  for  the  cells  (in 
STSGBAS  we  have  two  options:  sguare  and  octogcnal)? 

The  screen  coordinates  for  the  cells  (SYSGRAS  divides 
the  screen  into  a  chessboard,  and  therefore  we  have  a 
span  of  8x8  positions  to  place  the  array  cells) . 


-  Which  are  the  input  and  output  ports?  This  is  depen¬ 
dent  on  the  subroutine  that  iapleaents  the  ceil 
processor ,  and  so  oust  be  coapatible  with  the  defini¬ 
tion  of  the  processor  cell  subroutine. 

-  which  links  are  used  to  connect  the  different  ports  of 
the  different  cells?  We  need  to  identify  everything  so 
that  the  correct  infornation  at  S7SG3AS  can  be 
entered. 

This  aay  be  better  understood  with  an  exaaple,  and 
we  will  continue  the  develofaent  of  the  systolic  array  to 
perfora  Hatrix  Triangularization. 

For  this  particular  case,  we  will  use  three  types  of 
processors:  a  Givens  External  cell,  a  Givens  Internal  cell 

(both  with  sguare  roots) ,  and  a  Buffer  cell.  The  last  one  is 
used  to  help  the  visualization  of  the  input  data  array  being 
puaped  into  the  systolic  array. 

Figures  3.1,  3.2,  and  3.3  show  which  ports  are 

considered  the  active  teroinais  in  the  actual  implenentation 


if  *  »  O  tn»n 
begi  n 
c  *  l.o 
s  *  0.0 
end 

else  begin  _ 

temp  *  »rr+  z’ 
C  a  r  /  temp 
s  *  «  /  temp 
r  a  temp 
end 


Legend:  r  *  residue 


1  OdNl 


Figure  3.4  Systolic  Array  for  Hatrix  Triangular ization 


in  STSGBAS  foe  these  cell  processors.  Fig.  3.4  shoe  how  to 
arrange  the  cells  and  their  interconnections.  It  is  impor- 
tant  to  realize  to  the  fact  that  although  circular  shaped 
cells  axe  presented  in  Figs.  3.1  and  3.4,  they  are  ixple- 
aented  in  the  STSSB2S  as  octogonal  shaped  cells.  This 
approach  was  adopted  to  improve  the  coaputational  speed  of 
the  simulator.  However,  it  has  been  decided  to  keep  the 
standard  circular  shape  in  Figs.  3.1  and  3.4  .  An  addi¬ 
tional  detail  for  the  reader  is  the  fact  that  cells  10  to  13 
in  Fig.  3.4,  buffer  cells,  do  not  show  exactly  the  sane  way 
as  Fig.  3.3  does.  This  is  because  a  buffer  cell  is  itple- 
nented  in  STSGBAS  with  three  channels,  but  for  Matrix 
Triangular ization  only  one  channel  is  used  (and  so  only  one 
output  port  is  shown  in  Fig.  3.4).  Attention  should  also 
be  paid  to  the  fact  that  the  arrows  shown  in  Fig.  3.4  repre¬ 
sent  the  links  that  have  actually  been  used  in  our  simula- 
tion.  Ceils  nunber  1,2  and  4  of  that  Figure  are  of  the  sane 
type  of  the  other  sguare  shaped  cells  that  appear  in  the 
sane  Figure,  although  there  is  no  arrow  drawn  on  their  right 
side.  The  reason  is  that  a  third  order  systen  is  used  in  the 
simulation  and  so  there  is  no  need  to  extend  the  limits  of 
the  array  beyond  those  cells. 

This  type  of  preparation  that  has  been  presented 
here  is  very  helpful  when  entering  the  data  because  there  is 
a  large  amount  of  information  asked  by  ST5GBAS  to  be  entered 
interactively.  If  every  detail  is  planned  beforehand,  the 
interaction  between  the  user  and  the  simulator  becomes 
faster  and  easier. 

2 .  Planning  the  Input  Data  Array 

Cnee  the  systolic  array  has  been  planned  and 
sketched,  the  user  must  prepare  the  test  data  that  exercise 
the  simulator.  This  is  pehaps  the  most  important  step  in 
the  simulation.  At  this  point  the  only  rule  to  follow  is 


that  a  tine  shift  as  required  by  the  algoritha  has  to  be 
respected.  The  way  cclor  inforaation  will  be  assigned  to  the 
entering  data  is  coapletely  up  to  the  user,  and  the  greater 
his/her  creativity,  the  better  the  siaulation  effect,  and 
consequently,  the  easier  to  interpret  the  results. 

Fig.  3.5  shows  a  possible  vay  to  prepare  the  data. 
Ve  can  identify  in  that  figure  the  nuabers  from  our  numeric 
example  referred  in  last  subsection.  The  tiae  shift  that  can 
he  seen  is  necessary  to  provide  the  necessary  synchronism 
for  the  systolic  processing.  In  some  algorithms,  the  output 
data  to  the  external  world  must  also  be  collected  in  a 
synchronous  fashion.  But,  for  the  particular  case  of  Matrix 
Triangularization,  as  soon  as  the  final  result  is  achieved, 
each  datum  is  frozen  in  its  cell  and  waits  for  a  pumping  out 
operatioc  whose  connection  links  are  not  being  considered  in 
our  simulation. 


If.  2S£  HALT  SIS  OP  THE  SIHOIATIOHS 


A.  EIYIDED  DI7FEBESCI5 

Before  going  into  detailed  analysis  of  the  Matrix 
Triangnlarization  soxe  simple  algorithms  are  presented. 

In  numerical  interpolation,  we  need  to  compute  the 
divided  differences  from  a  set  of  points  and  then  form  a 
polynomial  from  these  divided  differences.  The  problem  of 
mapping  the  algorithm  for  calculating  these  differences  into 
a  systolic  array  structure  has  been  recently  addressed  in 
[fief.  10].  In  that  paper,  li  and  Smith  propose  a  systolic 
architecture  consisting  of  triangular  cells.  Some  upward, 
some  downward,  according  to  that  paper,  have  the  same 
internal  architecture  but  must  perform  differently  depending 
on  their  orientaticn  with  respect  to  the  data  flow. 
Certainly,  this  poses  the  problem  of  sensing  the  direction 
of  the  data  flow  and  implementing  some  additional  logic  in 
the  cells  to  change  their  function  accordingly.  Li  and  Smith 
also  discuss  some  implementation  problems  in  their  paper, 
and  pcint  out  that  their  actual  cell  "is  not  exactly  the 
same  as  described"  [fief.  10:  page  542]. 

Their  algorithm  is  implemented  in  SYSGBAS,  but  the  basic 
cell  has  been  modified;  grouping  one  upward  triangle  and  cne 
downward  triangle  into  the  same  cell.  It  is  noticed  that  the 
upward  cell  was  in  fact  actuating  just  as  a  traf fic-orientor 
buffer  and  no  additional  information  was  being  incorporated 
into  the  data  flow.  It  wouldn't  be  cost  effective  to  have 
that  kind  of  processor  since  the  introduced  complexity  was 
due  to  the  need  to  establish  a  homogeneous  type  of  architec¬ 
ture  based  on  triangles.  Therefore,  our  basic  cell  became 
that  shown  in  Fig.  4.1,  and  the  systolic  architecture. 


x2 (o)  «  x2il> 

I 

r  a  Cy2Ci;  -  yiUJJ  /  C*Zli>  - 
yl loJ  *  r 
y2(o>  *  *• 

Legend:  r  *  residue 


Figure  4. 1  Divided  Differences  Cell  processor 

aleost  tbe  sane  as  tbat  proposed  by  Li  and  Smith,  appears  in 
Fig.  4.2. 

fle  have  investigated  the  probles  of  finding  the  divided 
differences  of  the  set  of  points  (1.0, 1.0),  (1.8, 2. 2), 

(3. 0,3. 3),  (4.3,3. 5)  and  (5.3,4. 1)  in  the  x-y  plane. 
According  to  [Ref.  10s  page  539]#  the  first  level  divided 
differences  are 

I'(1)  *  (y<2)  -y  (1)  )/(x  (2)  -x  (1) )  a 

*  (2.2-1.0)/(1.8-1.  0)  *  1.5000 
JM2)  *  (y  (3)  -y  (2) )  /(x(3)  -X  (2) )  =0.91666 
y*  (3)  *  (y  (4)  -y  (3) )/  (x  (4)  -x  (3) )  =  0.  15385 
y*  (“)  *  (y(5)  y  («))/(x(5)-x(4>)  =  o.eooo 
Continuing  with  the  calculations, the  second  level 
divided  differences  are 


-0.29167 

-0.30512 

0.19398 


Figure  4.3  Divided  Differences  Array  after  Clock  Cycle  2 


and  the  third  level  divided  differences  are 

-0. 00408 

0.14260 

and  the  fourth  level  is  0.03411.  This  suggests  a  kind  of 
pyramid  structure  that  can  be  seen  from  Figs.  4.2  up  to 
4.5.  The  calculation  of  the  differences  is  performed  in 
four-clock  cycles,  cne  represented  in  each  figure.  The 
bottom  row  displaying  the  first  order  differences,  the  upper 
rov,  the  second  order  and  so  on  till  the  top  cell.  He  have 
related  each  point  (F1-P5)  to  a  primary  color,  that  is,  PI 
is  blue,  P2  is  red,  P2  is  green,  P4  is  blue,  and  P5  is  red. 
In  Fig.  4.2  ve  can  see  the  result  of  the  first  step  of 
calculations,  with  the  mixing  of  these  colors  at  the  bottom 
cells.  Going  through  each  step  and  comparing  them  with  the 
mathematical  formulation,  we  get  a  better  understanding  of 
the  data  interaction.  The  the  final  results  in  Fig.  4.5 
should  be  compared  with  the  above  calculated  values. 

B.  MATRIX-MATRIX  HDI1IP1ICATICH 

This  is  another  simulation  problem  that  have  been  tried 
on  SYSGRAS.  The  presentation  of  the  algorithm  has  teen  dene 
at  [Ref.  Is  page  9].  Presented  in  Fig.  4.7  is  the  arrange¬ 
ment  cf  the  systolic  array  and  how  the  data  is  pumped  into 
it.  The  elements  of  matrix  A  can  be  seen  flowing  towards  the 
lower  right  and  those  cf  matrix  B  flowing  towards  the  lower 
left.  The  resultant  matrix  C  is  pumped  upwards.  We  have 
performed  a  simulation  to  compute  the  matrix  product  AB=c  as 
below 


1  9 

7  4 

10  6 


8  15  17 

29  63  70 

22  52  98 


The  scheme  used  for  systolic  computation  has  an  advan¬ 
tage  due  to  the  fact  that  a  great  number  of  matrices  that 
are  used  in  typical  problems  are  band  matrices  [Bef .  9]. 
Consequently,  the  systolic  array  does  not  need  to  include  as 
many  cells  as  it  would  have  to  for  the  case  of  full 
matrices.  In  this  numeric  example,  in  order  to  restrict  cur 
array  to  a  small  size  so  that  it  can  be  seen  on  the  screen 
(as  in  Pig.  4.8),  we  decided  to  set  elements  a  (1,3)  and 
b(3,1)  tc  zero.  This  way,  a  systolic  array  of  small  dimen¬ 
sions  can  multiply  larger  matrices  with  restricted  nonzero 
band. 

A  single  type  of  cell  is  used  in  this  algorithm.  It  is 
called  Inner  Product  Step  Processor.  Its  geometry  and  algo¬ 
rithmic  definition  axe  shown  in  Fig.  4.6  .  This  same  type  of 
cell  will  be  presented  later  in  another  algorithm 
implementation. 

In  this  simulation  problem,  different  colors  are  attrib¬ 
uted  to  different  rows  of  matrix  A  and  to  different  columns 
of  matrix  B.  As  we  know,  each  element  of  the  product  matrix 
C  will  be  generated  by  a  combination  of  one  row  of  A  and  one 
column  of  B.  When  these  elements  join  each  other  in  a 
systolic  cell,  we  will  be  able  to  identify  exactly  that 
combination  taking  place  at  each  cell  in  the  space-time 
frame.  We  present  here  the  coding  of  colors  that  was 
adopted : 


blue  blue  blue 


green  green  green 


blue  red  green 
blue  red  green 
blue  red  green 


blue  magenta  cyan 
magenta  red  yellow 
cyan  yellow  green 


i np ( 1 ) 

.out  (3) 

\  / 

*~\  out(l)  »  inpu;  i  feedthrough 

/  \  »np(2>  out(2)  ,  inp(2)  ;  feedthrough 

out  (2)  \  /  p  a  inP{1*  *  iop(2>  ♦  inp(3) 

j  ^  \ _ 

/  put  12)  ■  r  ;  processed  output 

i 

I  inp(3) 

1  out ( l ) 

Figure  4.6  Inner  Product  Step  Processor  Cell 


To  show  how  colors  can  help  in  understanding  the  mecha- 
nisa  of  multiplication  in  the  systolic  array,  we  will  track 
the  generation  of  the  element  c(3,2).  Re  see  fron  the  above 
color  coding  that  c{3,2)  is  a  yellow  element,  since  it 
results  from  the  combination  of  a  green  row  and  a  red 
column.  From  mathematics, 

c  (3,2)  =  a{3,1)  .b  (1,2)  ♦  a (3,2)  . b  (2,2)  ♦  a  (3,  3) .  b  (3, 2) 

Figure  4.7  shows  a  schematic  diagram  that  corresponds  to 
clock  cycle  number  1,  the  same  cycle  is  also  shown  in 
Fig.  4.8,  in  which  elements  a  (1,1)  and  b  (1 , 1)  have  entered 
cells  numbers  14  and  15  respectively.  From  Fig.  4.7  it  can 
also  te  seen  that  the  element  a  (3,1)  (green  element)  will 
enter  into  the  array  (at  cell  06)  at  clock  cycle  3  (see 
Fig.  4.7),  while  element  b(1,2)  (red  element)  will  enter  at 
clock  cycle  2  (at  cell  12)  (see  Fig.  4.9).  After  that,  they 
will  run  through  the  array  and  will  meet  each  other  at  cell 
02  at  clock  cycle  5  (see  Figure  4.12).  This  meeting  gener¬ 
ates  the  first  partial  result  a  (3, 1) .  b  (1,2)  .  This  can 


figure  4.10  Batrix-Batrix  Bultiplication  after  Clock  Cycle  3 
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te  verified  to  be  8x1=8  by  oar  numeric  exanple  and  by  the 
contests  of  cell  02  at  clock  cycle  5  .  This  partial  result 
with  yellow  color  (identified  as  c  (3 ,2)  for  convenience)  is 
puaped  upwards  to  cell  09.  Perforaing  similar  verification, 
it  can  be  seen  that  elements  a  (3,2),  b  (2 ,2)  and  that  partial 
result  of  g(3,2)  will  meet  at  cell  09  at  clock  cycle  6  (see 
Pig.  h.13) .  They  will,  then,  generate  a  second  partial 
result 

a(3,1)  .b(1,2)  +  a  (3, 2)  .b  (2,2) 

which  is  8x1+2x7=22  as  seen  in  Fig.  4.13  at  cell  09.  By 
similar  reasoning,  the  final  result  8x1+2x7+3x10=52  is 
generated  in  cell  14  at  clock  cycle  n  (see  Fig.  4.14). 

It  can  be  seen  that  matrix  C  is  symmetrical  with  respect 
to  the  color  coding.  Symmetrical  elements  like  C(3,2)  and 
c(2,3)  are  computed  in  parallel  and  pumped  up  side  by  side. 
So,  they  are  easily  tracked  with  the  help  of  colors,  and  the 
whole  operation  can  be  better  visualized. 

C.  MATBIX  TRI ANGULABIZATIOH  BX  GIVENS  ROTATIONS 

Be  will  provide  a  brief  review  of  this  problem.  Ihe 
nomenclature  to  be  used  is  the  same  as  that  in  Chapter  III. 
Ihe  reader  is  also  referred  to  Appendix  C,  "A  Numerical 
Example  for  Matrix  Triangularization”,  which  uses  the  same 
data  as  that  described  here. 

The  linear  system  that  needs  to  be  solved  is 

AX  =  B 

where  A  is  a  nxp  matrix,  X  is  a  column  vector  of  p  elements, 
and  E,  a  column  vector  of  n  elements. 

Be  will  premultiply  the  matrix  A  by  a  transformation 
matrix  Q  such  that  the  product  matrix  becomes  an  upper 
triangular  matrix  R.  To  keep  the  matrix  equality  it  is 
necessary  to  premultiply  vector  B  by  matrix  Q.  This  will 


result  id  the  product  vector  QB.  This  operation  is  shown 
below: 

QAX  =  QB 

and  as  B  =  QA,  it  becomes 
RX  =  QB 

The  Givens  Rotations  in  this  algorithm  under  simulation 
triangularize  the  matrix  A  and  computes  the  vector  QB 
concurrently.  This  is  possible  because 

Q(A|B)  =  CQA)  |  (QB)  =  R|  (QB) 

where  the  operator  "  |"  performs  matrix  concatenation.  For 
clarity,  if  we  define 


■  2  4  1 

A  -  5  7  4 

.301 


and  B  = 


then  A | B  becomes 


A  |  B  = 


2  4  1  12 

5  7  4  3 

3  0  18 


As  we  have  shown  in  Chapter  III,  the  simulation  uses  the 
cell  structure  seen  in  Fig.  3.4  and  the  test  data  seen  in 
Figure  3.5  (the  same  as  that  used  in  the  numerical  example 
cf  Appendix  C)  .  The  picture  that  appears  on  the  screen  at 
the  start  of  the  simulation  is  shown  in  Fig.  4. 17  which  the 
reader  should  compare  with  Fig.  3.4  in  Chapter  III. 
Fig.  4.17  shows  the  identification  of  each  cell  at  its  lower 
right  corner.  Cells  numbered  13  to  10  represent  the  data 
wavefront  to  be  pumped  into  the  systolic  array.  They  are  not 
part  cf  the  array,  but  only  buffer  cells  to  allow  presenta¬ 
tion  of  the  data  to  be  pumped  into  the  array  on  the  screen. 


The  final  elements  of  vector  Q B  are  calculated  in  cells  04 , 
02,  and  01.  The  final  elements  of  matrix  A  are  calculated  in 
cells  numbered  09,  08,  06,  07,  05  and  03. 

As  shown  in  Appendix  C,  the  values  for  AX  =  B  are  as 
follows  and  A| B  is  as  shown  above. 

‘  2  4  1  x (1)  12 

5  7  4  .  x{2)  =  3 

.  3  0  1  x{3)  8  . 

The  test  data  are  pumped  into  the  array  as  wavefronts 
that  can  be  seen  in  cells  13  to  10  just  before  being  entered 
(the  reader  can  refer  to  Fig.  3.5  to  the  several  wavefronts, 
each  one  corresponding  to  a  different  color) .  There  is  a 
time  shift  among  the  elements  of  the  same  row  for  the  matrix 
under  triangularization  (compound  matrix  A|B).  This 
displacement  is  needed  to  establish  the  correct  timing  for 
the  systclic  operation.  In  this  simulation,  we  decided  to 
assign  a  different  color  for  each  row  of  A|3  (as  shewn  in 
Fig.  3.5)  and  to  keep  track  of  the  flow  of  the  colored 
elements  through  the  array.  This  flow  can  help  us  to  under¬ 
stand,  in  a  sequence  of  space-time  frames,  how  the  whole 
array  behaves.  The  effect  of  the  different  clock  for  subseq¬ 
uent  cycles  can  be  seen  from  Fig.  4.18  to  4.26  .  In 

Fig.  4.18,  element  a  (1,1)  of  matrix  A  enters  into  cell  09. 
He  can  see,  in  cell  13  element  a  (2,1),  and  in  cell  12 
element  a  (1,2).  They  are  ready  to  be  clocked  in  at  the  next 
clock  cycle.  This  happens  as  seen  in  Fig.  4.19,  when  cells 
09  and  G8  display  the  stored  result  of  the  computation 
performed.  Notice  that  cells  09,  07,  and  03  actuate  like  a 
reflector  for  the  data  wavefront  that  is  going  downwards. 
They  rotate  the  data  flow  direction  by  90  degrees  counter¬ 
clockwise.  As  a  result  of  the  interaction  between  the  wave¬ 
fronts  that  go  towards  the  right  and  that  going  downwards, 
there  is  a  resulting  wavefront  that  flows  towards  the  lower 


right.  This  is  the  important  observation  because  it  shows 
how  the  effect  of  the  input  data  spreads  over  the  array 
(this  is  why  we  decided  to  associate  this  wavefront  with 
identical  colors) .  The  wavefront  flowing  towards  tbe  right 
carries  information  about  the  angle  that  the  row  elements 
must  he  rotated.  This  information  must  arrive  at  each  cell 
just  in  phase  with  information  of  the  matrix  element  being 
rotated,  which  is  being  transferred  to  each  cell  by  the 
downward  wavefront.  At  each  clock  cycle  a  new  rotation  angle 
is  calculated  at  the  inclined  boundary  cells  (Givens 
External  Cells,  namely  09,  07  and  03)  and  transmitted  tc  the 
rightest  cells  at  the  same  row  (that  is  from  cell  09  to 
cells  C8,  06  and  04  in  the  upper  row,  from  cell  07  to  cells 
05  and  02  in  the  seccnd  row  and  from  cell  03  to  cell  01  in 
the  lower  row) .  Partial  results  are  generated  at  each 
systolic  array  element  at  each  clock  cycle.  At  the  mcment 
when  the  downward  wavefront  start  feeding  zeros  into  a 
systolic  array  element,  it  freezes  its  content  and  do  not 
modify  it  any  more.  He  have  selected  the  black  color  to 
indicate  that  the  input  data  are  bringing  no  information. 
The  black  wavefront  flowing  through  the  systolic  array  acts 
as  a  freezing  wavefccnt.  Figs.  4.19  up  to  4.26  show  the 
effect  of  the  remaining  clock  cycles.  In  Fig.  4.26  we  have 
the  final  result  of  the  triangularization  frozen  into  the 
cells.  The  upper  triangular  matrix  R  and  the  vector  QB  have 
teen  computed.  It  can  be  checked  out  against  those  values 
shown  in  Appendix  "A  Numerical  Example  for  Matrix 
Triangularization".  Ihe  computed  results  are  ready  to  be 
used  in  the  Back  Substitution  to  solve  the  system  equations. 
In  order  to  transfer  the  data  from  the  cells  to  the  hardware 
that  performs  the  Back  Substitution  operation,  a  special 
connection  which  is  net  shown  here  becomes  necessary.  But, 
it  is  not  required  here  for  the  understanding  of  the  Givens 
Rotations  process. 


He  will  now  present  a  different  kind  of  perspective  to 
the  reader  on  how  the  algorithm  performs.  The  geometrical 
interpretation  of  Givens  Rotations  is  extremely  helpful  to 
provide  tetter  insight.  A  matrix  can  be  considered  as  repre¬ 
senting  a  set  of  vectors  (as  many  as  the  number  of  columns) 
in  a  n-dimensional  space  (n  equals  the  number  of  rows) .  If 
the  matrix  has  only  three  rows,  their  columns  are  vectors  of 
a  three  dimensional  space,  and  the  elements  of  the  first  row 
are  the  components  ever  the  x-axis  of  the  column  vectors. 
The  rctaticn  operation  does  not  rotate  the  vectors.  The 
reference  frame  (cocrdinate  axes)  is  the  one  that  is 
rotated,  and,  as  a  consequence,  the  description  of  the 
vectors  in  terms  of  their  components  with  respect  to  the  new 
reference  becomes  different.  When  a  matrix  is  rotated  to 
become  an  upper  triangular,  the  first  column  is  transformed 
in  such  a  way  that  orly  element  (1,1)  will  be  nonzero.  This 
means  that  the  first  column  vector  is  positioned  over  the 
x-axis  ir  the  new  reference  frame.  As  the  vector  is  the  same 
as  before,  the  numerical  value  of  the  new  element  (1,1)  must 
be  egual  to  the  length  of  the  vector.  Element  (1,2)  of  the 
rotated  matrix,  for  example,  is  the  x-component  of  the 
second  column  vector  in  the  new  reference  frame.  Since  the 
relative  position  of  both  vectors  is  unaltered,  the  new 
value  of  that  element  must  be  equal  to  the  projection  of  the 
second  column  vector  over  the  first  column  vector,  since 
this  last  one  is  now  along  the  new  x-axis.  The  geometrical 
interpretation  can  be  extended  to  all  elements  of  the  array. 

He  will  track  tie  build  up  of  the  numerical  values  of 
cells  09  (that  stores  element  (1,1))  and  08  (that  stores 
element  (1,2)).  This  allows  us  to  study  the  operation  of 
both  tjpes  of  cells  used  in  this  algorithm  (cell  09  is  an 
outer  cell  and  cell  08  is  an  inner  cell) .  It  will  also 
provide  a  comparison  between  the  geometric  interpretation 
presented  above  and  the  algebraic  values  shown  in  graphics. 
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if  *  *  o  tn«n 
begi  n 

c  =  1  •  0 

s  »  0.0 
end 

else  basin  _ 


temp  =*  vr*  +  z* 
car/  tamp 
s  *  *  /  temp 
r  =  temp 


Legend:  r  a  residue 


Figure  4.27  Givens  External  Processor  Cell 


We  will  start  with  cell  09.  From  Fig.  4.27  we  see  that 
this  cell  receives  the  elements  of  the  "nonrotated"  first 
column  vector,  computes  the  rotation  angle  (calculating  its 
outputs  cosine  c  and  its  sine  s) ,  and  stores  the  "rotated" 
x-component  r  of  the  first  vector.  The  other  components  are 
obviously  zeros  and  correspond  to  the  other  elements  of  the 
first  column  (cells  cf  the  lower  array  that  are  not  shown 
because  their  contents  are  always  zeros) .  The  rotation 
angle  information  is  passed  as  output  to  cell  08  to  be  used 
to  "rotate"  the  second  vector.  Table  I  shows  how  the  param¬ 
eters  r,  c,  and  s  of  this  cell  respond  to  the  external 
inputs.  These  values  can  be  verified  with  the  help  of  the 
cell  algorithm  presented  in  Fig.  4.27  .  The  description  cf 
the  operation  can  ke  followed  through  the  sequence  of 
Figs.  4.28  to  4.30. 


TABLE  I 

Time  Description  of  Outer 

Cell  Operation 

Cycle 

Input  z 

Output  c 

Output  s 

Residue  r 

0 

0.0 

[black) 

1.000 

0.000 

0.000 

1 

2.0 

blue) 

0.000 

1.000 

2.000 

2 

5.0 

|red) 

0.371 

0.928 

5.385 

3.0 

green) 

[black) 

0.874 

0.487 

6.  164 

4 

0.0 

1.000 

0.000 

6.164 

At  clock  cycle  0  no  input  has  been  received  and  so  there 
is  nothing  to  rotate.  The  rotation  angle  is  zero  (c=1.0  and 
s=0.0).  At  cycle  1,  the  first  element  is  received.  The  first 
element  is  the  old  x-component  of  the  first  column  vector. 
As  it  is  over  the  x-axis,  the  reference  does  not  need  to  be 
rotated  now.  However,  when  this  occurs,  because  of  the  way 
the  algorithm  is  implemented  {we  will  see  the  reason  when 
analysing  cell  08),  an  angle  90  degrees  is  informed  to  cell 
08  (c=0.0  and  s=1.0)  although  the  value  r  stored  in  cell  09 
is  2.0,  egual  to  the  input.  At  clock  cycle  2,  the 
y-component  of  the  old  first  column  vector  is  entered.  Now  a 
rotation  is  necessary  to  keep  the  x-axis  of  the  reference 
frame  over  the  first  column  vector.  The  rotation  angle  is 
computed  (c=0.371  and  s=0. 928)  and  the  reference  frame  is 
rotated  (about  the  z-axis).  The  new  x-component  (stored  in 
r)  is  the  square  root  of  the  summation  of  the  squares  of  the 
old  x  and  y-components,  that  is  5.385  (corresponding  tc  the 
projection  of  the  first  column  vector  over  the  x-y  plane)  . 
At  cycle  3,  the  z-component  is  entered.  As  the  last  rotation 
resulted  in  a  vector  over  the  x-axis,  the  combination  of 
this  with  the  just  entered  z-component  will  result  in  a 
vector  in  the  x-z  place  deviated  from  the  x-axis  because  of 
the  newly  arrived  z-ccmponent.  Again  a  rotation  is  necessary 


Figure  4.28  Beference  Frame  before  Rotations 


to  keep  the  x-axis  over  the  first  column  vector.  The  new 
x-component  is  6.164.  At  clock  cycle  4  a  black  datum  is 
received  {a  zero  value)  and  that  freezes  the  contents  of 
cell  09  with  its  final  value.  Bo  more  rotations  are  required 
and  the  output  of  the  cell  informs  rotation  angle  zero 
(c=1.0  and  ss0.0)  . 

Bcw  we  will  follow  the  build  up  of  the  contents  of  cell 
08  (the  right  side  neighbor  cell  of  cell  09).  Cell  08  stores 
element  (1,2) ,  the  x-component  of  the  second  column  vector. 
As  done  before,  we  present  table  II  with  inputs  and  values 
stored  in  this  cell  at  each  clock  cycle.  In  this  case  we 
will  disregard  the  numerical  output  of  this  cell  because  we 
will  concentrate  on  the  build  up  of  the  residue  r  in  this 
cell.  From  Fig.  4.31,  this  cell  receives  as  input  the  rota* 
tion  angle  (inputs  c  and  s)  ,  by  which  the  reference  frame 
has  changed,  to  be  used  to  compute  the  new  x-component  of 
the  second  column  vector  (to  be  stored  at  r) .  The  ctber 
input  refers  to  the  elements  of  the  "nonrotated"  second 


Figure  4.29  Reference  Fraae  after  First  Rotation 


Figure  4.30  Reference  Frane  after  Second  Rotation 


figure  4.31  Givens  Internal  Processor  Cell 


column  vector  (input  z) .  The  outputs  of  this  cell  are  the 
rotation  angle  (c  and  s  are  passed  as  output  to  the  right 
neighbor  cell  05)  and  rotation  information  to  the  neighbor 
cell  iaaediately  below  cell  08  to  "rotate"  the  other  compo¬ 
nents  of  the  coluan  vector  as  a  result  of  the  reference 
frame  rotation. 


The  operation  of  this  cell  is  more  difficult  tc  visu¬ 
alize.  Figure  4.32  will  be  used  for  the  explanation.  It  only 
shows  the  x-y  plane.  Suppose  the  second  column  vector  is  A, 
as  seen  in  that  Figure.  The  cld  reference  frame  is  xl-yl. 
The  ccapcnents  of  A  with  respect  to  that  frame  are  ax  and 
ay.  In  our  numerical  example,  ax=4.0  is  the  first  input 
z(i)  to  the  cell.  The  second  input  z(i)  to  the  cell,  on  the 
following  cycle,  is  aj*7.0.  Suppose  the  reference  frame  is 
rotated  about  the  z-axis  to  position  x2-y2  to  keep  the 
x-axis  along  the  first  column  vector,  which  is  shown  as  9. 
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figure  4.32  Geometric  Interpretation  of  a  Rotation 


Instead  of  evaluating  the  coapcnents  of  A  in  this  new  frame, 
let  us  study  the  effect  over  its  components  ax  and  ay.  In 
this  new  frame,  ax  will  be  decomposed  into  components  axx 
(over  the  new  x-axis)  and  axy  (over  the  new  y-axis)  . 
Similarly,  ay  will  be  decomposed  into  components  ayx  (over 


the  nev  x- axis)  and  ayy  (over  the  new  y-axis) .  The  component 
of  A  over  the  nev  x-axis  is  the  suuation  of  axx  and  ayx. 
This  will  he  the  nev  eleaent  (1,2).  is  seen  in  Fig.  4.32, 

axx  -  ax  .  cos(alfa)  and  ayx  =  ay  .  sin(alfa) 

and  as  element  (1,2)  is 

r  =  axx  ♦  ayx 


it  results 


r  *  ax  •  cos(alfa)  +  ay.  sin(alfa) 

As  ax  was  the  previous  r  and  ay  is  the  nevly  arrived  input 
z(i),  in  computer  algorithmic  language  we  have 

r  *  c(i)  *  r  ♦  s(i)  *  z  (i) 

as  in  Fig.  4.31  .  Substituting  the  values  of  clock  cycle  2 
into  this  equation,  we  find 

r  *  0.0  *  0.0  ♦  1.0  *  4.0 

r  *  4.0 

and  this  can  be  checked  against  the  above  table.  At  clock 
cycle  3,  the  values  give  us 

r  =  0.371  *  4.0  ♦  0.928  *  7.0 

r  =  7.985 

that  also  checks  against  the  above  table.  Doing  the  same  for 
clock  cycle  4,  ve  get  r=6.976.  This  gives  the  final  value  of 
the  x-ccmponent  of  the  second  column  vector.  The  rotation, 
however,  will  affect  all  other  components.  Let  us  describe 
how  this  occurs.  The  component  of  A  over  the  nev  y-axis  is 

z  (o)  *  ayy  -  axy 

z  (o)  *  ay  .  cos(alfa)  -  ax  .  sin  (alf a) 


3 


and  in  algorithmic  language, 

2(0)  »  c(i)  *  z  (i)  -  s(i)  *  r 

This  inf creation  z(c)  is  passed  to  the  cell  iamediately 
below  to  generate  the  new  y-coaponent  of  the  second  colunn 
vector. 

A  farther  point  to  be  noticed  is  that  if  the  rotation 
angle  inpat  to  an  inner  cell  is  zero  (c=1.Q  and  s=0.0),  the 
cell  will  not  change  the  stored  value  at  r.  This  is  why  the 
rotation  angle  is  inforaed  as  90  degrees,  as  aentioned 
previously,  when  the  first  element  is  received  at  cell  09. 
This  triggers  cell  C8  to  receive  and  store  the  z  input  at 
the  following  clock  cycle.  At  the  end  of  clock  cycle  02,  the 
z=4. 0  input  is  stored.  No  rotation  is  performed,  so  r=4.0. 
The  following  clock  cycles  impose  the  modification  of  the 
contents  of  this  cell  because  of  the  changes  in  the  refer¬ 
ence  frame.  As  soon  as  the  input  z  freezes  (at  clock  cycle 
5) ,  the  residue  r  cf  the  cell  also  freezes.  It  can  be 
observed,  if  the  outputs  of  cell  09  and  its  inputs  to  cell 
08  are  compared,  the  transamission  delay  of  one  clock  cycle 
from  a  cell  to  another.  It  can  also  be  noticed  that  the 
inputs  to  cell  08  always  have  the  same  color.  This  simula¬ 
tion  was  purposely  designed  this  way  to  show  the  need  for 
sinchronization  between  the  wavefronts  that  propagate 
through  the  array. 

Certainly  we  cannot  present  all  this  complexity  with  the 
simulator.  The  algorithm  of  Givens  Rotations  is  the  most 
complex  that  we  have  traced  in  our  survey.  However,  the 
simulator  has  been  a  fundamental  tool  to  perform  a  numerical 
study  and  to  verify  the  correctness  of  the  algorithm. 


0.  B1CK  SOBSTITOTIOl 


le  will  use  the  results  of  the  previous  section  as  input 
data  for  the  simulation  described  in  this  section.  This  way, 
with  these  two  sections,  we  can  go  through  the  complete 
problem  described  in  the  Appendix  "A  Numerical  Example  for 
Matrix  Triangularization" .  That  is,  to  search  for  the  vector 
X  in  matrix  equation  AX=B.  The  process  of  Back  Substitution 
is  described  in  [Bef.  7:  page  24]. 

One  of  the  cells  used  in  this  algorithm  is  shown  in 
Figure  4.33  .  The  above  reference  presents  the  cell  shown 
in  Fig.  4.33  as  a  circular  shaped  cell  named  Back 
Substitution  Cell.  Since  SYSGRAS  would  take  too  much  tine  to 
draw  circular  shapes,  an  octogonal  shape  is  used  instead 
(see  cell  number  05  in  Figure  4.37). 

The  other  type  cf  cell  used  in  this  algorithm  is  the 
same  Inner  Product  Step  Processor  that  was  presented  in 
Fig.  4.6  to  do  Matrix-Matrix  Multiplication.  This  is  an 
interesting  aspect  of  systolic  arrays:  many  algorithms  that 
appear  in  £h§  literature  are  implemented  wjth  the  same  types 
of  cells.  However,  the  geometry  of  the  array,  in  this  algo¬ 
rithm,  requires  the  cell  to  be  square  shaped.  Although  the 
cell  is  functionally  the  same  as  shown  in  Fig.  4.6,  we 
present  it  again  in  Figure  4.34  to  natch  the  way  Fig.  4.35 
represents  it  as  part  of  the  structure  (see  cells  04  and 
02) . 

The  mathematical  background  for  solving  a  triangular 
linear  system  involving  a  lover  triangular  array  has  been 
presented  by  Rung  in  [Ref.  Is  page  19].  We  will  adopt  here 
his  explanation. 

Suppose  the  system  equation  to  be  solved  is 
RX  =  D 


where  R  =  (r(i,j))  is  a  nonsingular  nxn  lower  triangular 
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Figure  4.33  Back  Substitution  Main  Cell 
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Figure  4.34  Inner  Product  Step  Processor  Cell 

aatrix  and  0  is  a  a  n-vector,  both  being  given.  To  compute 
the  vector  1,  we  can  use  Forward  Substitution  as  follows; 

k  ;=  1 
Y(i,1)  *•=  0 
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repeat 

y  (i,k*1)  :=  y(i,k)  ♦  r(i,k)  .  x(k) 
k  :=  k  +  1 
until  k  =  1 

x(i)  :=  (d  (i)  -  y(i#i))  /  r(i,i) 

The  above  algorithm  can  be  used  to  calculate  the 
elements  cf  vector  X  in  the  seguence  x(1),  x(2),  ...  and  so 
on.  Ibis  algorithm  can  be  implemented  in  a  systolic  array  as 
will  be  shown.  Y  =  (y(i,k))  is  a  vector  of  partial  results 
that  allcvs  the  recurrence  to  build  up. 

In  our  case,  since  the  interest  is  in  Back  Substitution, 
we  will  simply  enter  the  data  into  the  systolic  array  in  an 
order  reverse  to  that  proposed  by  Rung  in  his  paper.  Let  us 
make  this  point  more  clear.  We  do  Forward  Substitution  when 
we  have  a  lower  triangular  array.  In  this  case  we  solve 
first  for  x(1),  next  for  x  (2)  and  so  on.  Following  the 
seguence  proposed  by  Rung,  vector  X  should  be  pumped  into 
the  systolic  array  as  x(1)  ,  x(2)  and  x  (3)  .  The  same  for 

vector  QB.  Since  here  we  are  doing  Back  Substitution,  we 
enter  the  vector  X  in  a  reverse  seguence  x  (3)  ,  x(2),  and 

x(1) .  QE  is  also  entered  the  same  way.  The  way  matrix  R 
being  pumped  into  the  systolic  array  is  modified  with 
respected  to  that  proposed  by  Rung.  For  instance,  the  first 
of  its  elements  to  be  pumped  into  the  systolic  array  is 
r(3,3)  (see  cell  10  in  Fig.  4.36)  which  is  reguired  to 
compute  x(3).  In  the  Forward  Substitution  Hethod,  the  first 
element  pumped  should  be  r(1,1)  to  compute  x(1).  The  ether 
elements  of  R  are  rearranged  accordingly. 

The  system  eguaticn  to  be  solved  in  this  simulation  is 

EX  =  QB 

where  R  is  an  upper  triangular  matrix  and  QB  is  a  n-vector 
that  resulted  from  the  transformation  seen  in  the  previous 
Secticn. 


r«,a) 


Figure  4.35  Data  krrangeaeat  for  Back  Substitution 


Figure  4.35  presents  the  arrangement  of  the  data  with 
respect  to  the  systolic  array.  The  data  elements  are  drawn 
in  such  a  way  that  the  required  data  synchronization  to 
perform  the  operation  is  evident.  Vector  QB  and  matrix  B 
actually  carry  data  into  the  systolic  array  for  processing. 
Vector  Y  enters  the  array  bringing  no  data.  Its  elements  are 
all  zeros  at  the  moment  represented  in  Fig.  4.35  .  Its 
values  are  modified  as  it  flows  through  the  array.  Vector  X 
is  generated  into  cell  05f  the  Back  Substitution  Cell,  and 
its  elements  are  pumped  through  the  array  in  a  direction 
opposite  to  that  of  vector  Y.  As  they  go  through  cells  04 
and  02,  they  combine  with  elements  from  matrix  B  to  build  up 
the  elements  of  vector  Y.  Finally,  X  will  come  out  with  its 
final  value  when  it  leaves  the  array,  from  cell  number  02  of 
Fig.  4.35. 

Figures  4.35  and  4.36  should  be  compared.  This  last  one 
presents  the  same  arrangement  as  the  former,  but  it  shows 
how  the  picture  appears  on  the  screen.  Matrix  B  is  on  the 
upper  block  of  the  cells  (cells  number  07  to  21}  (Refer  to 
Appendix  C  to  compare  the  numerical  values) .  This  block  dees 
not  belong  to  the  array  itself  and  is  used  only  for  presen¬ 
tation  of  the  data.  The  systolic  array  is  represented  by 
cells  05,  04,  and  02.  The  elements  of  vector  QB  are  intro¬ 
duced  from  its  left  side  (at  cell  06).  The  vector  of 
partial  results  Y,  a  string  of  zeros  separated  by  one  deck 
cycle  delay,  is  pumped  in  from  the  right  side  (at  cell  03)  . 
The  output  that  will  collect  the  solved  elements  of  vector  X 
is  represented  by  cell  01.  Attention  to  the  fact  that  the 
numbers  that  are  displayed  at  cells  04  and  02  are  the  values 
assumed  by  the  elements  of  vector  Y  as  they  flow  leftwards 
through  the  array.  The  number  displayed  in  cell  05  is  the 
element  cf  vector  X  being  computed  and  that  in  cell  01  is 
the  element  of  X  being  pumped  out  to  the  external  world. 
Another  point  to  notice  is  the  way  the  bidirectional 


connections  between  cells  05  and  04,  and  04  and  02  is  imple- 
nented  by  SYSGRAS.  Only  one  bidirectional  arrow  is  used. 

Re  decided  to  associate  the  elements  of  each  column  of 
the  matrix  R  (third  column  in  blue,  second  in  red,  and  first 
in  green)  and  the  corresponding  row  elements  of  vectors  X 
and  QB  with  the  same  color.  (e.g.  the  element  number  3  of 
the  QE  will  interact  only  with  the  column  number  3  of  the 
matrix  R) .  Cell  processor  number  5,  the  so  called  Back 
Substitution  Cell  is  shaped  differently,  as  mentioned 
earlier,  to  emphasize  its  special  purpose  in  the  systolic 
array.  This  cell  processor  computes  the  elements  of  vector  X 
from  the  received  data  and  pumps  them  out  to  cell  04  to  be 
utilized  for  calculation  of  the  previously  referred  partial 
results.  Hhen  the  X  elements  complete  their  trip  through 
the  array,  they  are  pumped  out  at  cell  01. 

The  manner  in  which  color  is  used  to  provide  information 
is  now  described.  The  elements  of  vectors  QB,  X  and  Y,  as 
well  as  matrix  R,  have  primary  colors  (red,  blue  and  green), 
jf  When  they  meet  elements  of  different  primary  colors,  the 

cell  where  this  meeting  takes  place  assumes  a  secondary 
color  that  is  the  result  of  that  combination.  The  elements 
of  vector  Y,  for  exanple,  can  be  seen  with  their  original 
1  color  in  cell  03,  before  mixing  up  with  others.  During  their 

trip  through  the  array  they  keep  that  color,  but  as  a  result 
of  their  combination  with  different  color  elements,  the  cell 
where  they  might  be  may  present  different  secondary  colors. 
However,  we  should  keep  in  mind  that  only  primary  colors 
travel  from  a  cell  to  another.  Secondary  colors  are  static. 

",  Tc  make  these  points  more  clear,  we  will  track  the 

;  formation  of  elements  x(3)  and  x  (2) .  Since  R  is  an  upper 

f 

triangular  matrix,  we  have 

x(3)  =  qb(3)  /  r  (3,3) 

1  Substituting  numerical  values,  as  shown  in  Appendix  C, 

•  x  (3)  =  -10.593  /  0.842  =  -12.571 


[»] 


As  it  has  been  pointed  out,  the  Back  Substitution  Cell,  cell 
number  05  in  Fig.  4.37,  is  responsible  for  computing  the 
elements  of  X.  That  Figure  presents  the  computation  of 
element  x(3).  Element  gb(3)  was  pumped  in  from  cell  06  (see 
Fig.  4.36)  and  element  r(3,3)  from  cell  10  (see  same 
Figure).  It  can  also  be  seen  in  Fig.  4.37  that  x(3)  is  blue 
since  it  was  formed  by  the  combination  of  gb  (3)  and  r(3,3), 
both  blue  (remember  that  the  third  column  of  B  has  been 
coded  blue).  After  its  computation,  element  x(3)  is  pumped 
through  the  array  to  cell  04  (clock  cycle  2,  Fig.  4.38),  to 
cell  02  (clock  cycle  3,  Fig.  4.39)  and  finally  to  the 
external  world  (clock  cycle  4,  Fig.  4.40)  at  cell  01.  During 
this  trip,  its  value  is  used  for  computation  of  the  elements 
of  vector  Y  which  are  necessary  for  computation  of  the  other 
elements  of  vector  X.  How  let  us  track  the  formation  of 
x(2)  . 

From  algebra,  we  have 

x(2)  =  (gb  (2)  -  x  (3)  .r  (2,3) )  /  r(2,2) 

The  partial  result  x(3).r(2,3)  is  computed  at  clock  cycle  2, 
in  cell  04.  This  cell  receives  x(3)  from  cell  05  (computed 
at  deck  cycle  1,  see  Fig.  4.37)  and  r(2,3)  from  cell  08 
(see  Fig.  4.37).  This  partial  result  is  called  y(2)  (color 
coded  red  because  it  will  contribute  to  the  formation  of 
x(2),  which  color  code  is  red).  This  time,  y  (2) ,  being  orig¬ 
inally  red,  appears  in  cell  04  as  magenta  because  the 
contribution  of  y{2)  is  activated  blue  when  it  interacts 
with  blue  r  (2, 3)  and  blue  x(3).  At  this  point  of  the  compu¬ 
tation  all  necessary  data  to  compute  x  (2)  is  available. 
Element  y  (2)  is  stored  in  cell  04,  element  gb  (2)  is  ready  to 
enter  the  array  (see  cell  06,  in  Fig.  4.38),  and  so  is 
element  r{2,2)  (see  cell  10  of  Fig.  4.38).  At  clock  cycle  3 
these  elements  are  pumped  into  the  Back  Substitution  Cell 
(cell  05)  (see  Fig.  4.39)  and  the  computation  of  x  (2)  takes 


place.  Substituting  numerical  values  (as  seen  in  these 
cells)  into  the  above  equation  it  becomes 

Z  (2)  =  (-0.566  ♦  11.  538)  /  4.042  =  2.714 

Element  x(2)  assumes  color  red  because  all  factors  that 
contributed  to  its  formation  had  that  color  code.  After  its 
computation,  x(2)  is  pumped  through  the  array  via  cell  04 
(clock  cycle  4,  Fig.  4.40),  cell  02  (clock  cycle  5, 
Fig.  4.41)  and  finally  pumped  out  to  cell  01.  During  this 
trip,  similarly  for  x<3),  it  contributes  to  the  formation  of 
y(1),  another  element  of  the  vector  of  partial  results.  This 
will  be  required  to  the  computation  of  x(1). 

The  sequence  of  pictures  from  Fig.  4.36  to  4.44  displays 
the  whole  computational  process  in  the  systolic  array 
according  to  the  algorithm. 

E.  FBBTBEB  EXPLOBATIC1S 

The  potential  of  SYSGRAS  goes  far  beyond  the  implementa- 
tion  of  systolic  algorithms.  Presently,  because  of  pratical 
importance,  extensive  research  is  being  carried  out  on  the 
investigation  of  faults  in  the  actual  systolic  devices 
£Bef.  11].  How  to  circumvent  those  faults  and  which  effect 
they  do  have  on  the  results  are  some  of  the  subjects  under 
study.  STSGBAS  can  be  used  as  a  valuable  tool  in  performing 
such  studies,  because  of  the  simplicity  of  its  interface 
with  the  user  and  because  of  the  possibility  of  inclusion  of 
possible  subroutines  to  support  the  algorithms.  To  emphasize 
its  versatility,  we  address  the  fact  that  a  processor  cell 
may  have  a  large  number  of  input/output  ports  (although  it 
is  presently  set  up  for  a  maximum  of  four) .  This  allows  a 
large  number  of  connections  with  other  cells  or  the  outside. 
Ibis  characteristic  cat  be  used,  as  example,  to  inject  data 
into  a  cell  that  is  situated  in  any  position  in  the  array. 


These  data  could  consist  of  logical  commands  that  would 
alter  the  function  cf  the  processor  for  the  selected  cell. 
These  commands,  in  a  particular  application,  could  "kill” 
the  cell,  or  degrade  its  performance  according  to  a  desired 
need.  Another  approach,  that  would  reduce  the  data  entry 
volume,  would  be  to  design  a  subroutine  to  implement  a 
defective  processor  and  assign  that  processor  to  the  target 
cell.  The  color  feature  can  also  be  used  to  analyse  the 
effect  that  faults  at  a  particular  cell  will  have  over  the 
others.  If  a  color  is  injected  at  selected  input  ports 
(depending  on  the  software  of  the  subroutine  that  supports  a 
processor) ,  it  will  spread  over  the  cells  that  receive  data 
from  this  faulty  cell  under  evaluation,  and  will  display  the 
"bad  sector"  on  the  screen. 

Many  other  possible  studies  related  to  systolic  arrays 
may  be  considered  due  to  the  flexibility  of  the  SYSGBAS 


7.  THE  INTERACTION  WITH  THE  SIMOLATOB 

A.  SCFT1ABE  BEQQIBEHEBTS  AHD  PROCEDURES 

In  order  to  operate  with  SYSGRAS,  the  user  must  have  the 
following  files. 

SYSTOY.PAS 
SYSFCE. FCB 
10GIII.COM 
JOKK.CCH 

The  user,  after  compiling  SYSTOY.PAS  and  SYSFOR.FCR, 
must  link  both  relocatable  codes  with  SIGGRAPH  core,  which 
is  at  CS3202.COR2  library.  To  do  this,  the  command  to  be 
used  is 

LINK  SYSTOY ,S YSFOR, (CS3202. CORE) CORE/LIB 

In  order  to  run,  enter  the  command  ’'RON  SYSTCY" .  The  code 
will  address  the  EAMTEK  peripheral  system  when  running,  and 
this  device  should  be  turned  on  and  ready  to  operate. 

B.  ENTERING  A  HEN  PROBLEM 

Nhen  the  command  "RON  SYSTCY"  is  entered,  the  simulator 
starts  prompting  all  questions  that  must  be  answered  by  the 
user  to  enter  the  problem.  In  Appendix  D  we  have  recorded  a 
session  of  the  Simulator  to  make  the  interaction  more  under¬ 
standable.  It  was  cur  original  intention  to  present  this 
recording  of  the  session  as  a  guide  to  SYSGRAS  for  the 
Matrix  Triangularization  problem. However ,  due  to  the  large 
amount  of  material  that  needs  to  be  presented,  we  decided  to 
include  a  short  "dummy"  session  in  Appendix  D  for  clarity. 
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It  shovs  hew  to  enter  the  data  and  how  to  handle  an  incor¬ 
rect  input.  The  simulator  permits  recovery  from  entry  errors 
and  permits  display  of  input  data  as  many  times  as  desired. 
The  user  has  the  following  choices  when  dealing  with  the 
simulator: 

select  a  new  problem  run  or  a  previous  problem  session 

-  record  the  session  in  a  separate  file  for  later 
printing 

-  ability  to  interrupt  the  graphics  presentation  at  the 
end  of  each  clock  cycle 

It  is  recommended  that  the  user  follow  exac tl v  the  same 
nomenclature  that  was  mentioned  in  Chapter  III  Section  E  to 
avoid  errors  that  might  be  difficult  to  troubleshoot. 

C.  REVIEW  OP  A  PREVIOUS  SESSION 

We  have  included  in  Appendix  E  a  recording  of  a  session 
with  SYSGRAS  for  the  purpose  of  reviewing  a  previous 
session.  The  user  must  take  care  to  rename  the  file  which 
contains  the  data  from  previous  session  into  a  file  under 
the  name  SYSARRAY.  MEM  .  This  file  must  be  the  most  recent 
version  generated  by  the  VAX  VMS  Operating  System. 

The  following  files  with  data  from  previous  sessions  are 
already  available,  and  may  be  run  upon  request  freir  the 
user : 

-  file  DIVDIFF. HER,  with  recording  of  simulation  anal¬ 
ysed  in  Chapter  IV,  Section  A. 

-  file  MULTPLY.MEH,  with  recording  of  simulation  anal¬ 
ysed  in  Chapter  IV,  Section  B. 

file  GIVENSW. HEM,  with  recording  of  simulation  anal¬ 
ysed  in  Chapter  IV,  Section  C. 
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-  file  BAKSUBS. MEN,  with  recording  of  simulation  anal¬ 
ysed  in  Chapter  IV,  Section  D. 

0.  BECOBDIHG  OF  COHPOTED  DATA  AND  PRINTING  OOT 

If  the  user  has  asked  for  this  facility,  SISGEAS  will 
write  all  computed  results  of  each  cell  at  each  clock  cycle 
into  a  file  named  SISPEINT.DOC,  which  can  be  printed  out 
after  the  end  of  the  session. 
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A.  STRUCTURE  OF  THE  SIMULATOR 

The  SISGBAS  has  been  designed  in  tvo  layers.  The  upper 
layer  that  interacts  with  the  user  is  in  PASCAL ,  and  so  it 
is  guite  portable.  In  fact,  due  to  the  use  of  the 
,,OTHEBilISI,,  feature  offered  by  the  "CASE"  command  in  the  VAX 
PASCAL  COMPILES,  a  feu  of  its  case  structures  have  to  be 
slightly  modified  before  it  can  be  installed  in  another 
machine.  The  lover  layer  is  constructed  in  FORTRAN  77  and 
it  is  compiled  with  VAX  FOB  THAN.  This  layer  interfaces  with 
the  Graphics  Package  SIGGRAPB  that  is  presently  available  at 
NPS  on  the  VAX  750  machine.  It  has  been  set  up  for  presenta¬ 
tion  cn  the  RAHTEK  9400  system,  and  so  it  is  not  portable. 
Its  structure,  however,  can  be  modified  to  interface  with 
another  graphics  package  with  characteristics  similar  to 
those  of  SIGGRAPH. 

The  PASCAL  layer  is  presented  in  Appendix  A  and  the 
FORIRAN  layer  is  in  Appendix  B. 

B.  MODULAR  DESIG1  ASPECTS 

The  design  of  STSGRAS  has  followed  the  philosophy  of 
modular  design.  The  data  structure  has  been  designed  to 
match  the  abstract  idea  of  a  systolic  array  and  its  multiple 
features,  subsequently  it  can  be  easily  identifiable  by  the 
program  user.  The  basic  element  is  the  record  "NODE".  This 
element  keeps  all  the  information  regarding  each  cell.  The 
whole  array  is  a  collection  of  N0D£S*s,  and  these  are  assem¬ 
bled  into  a  higher  level  hierarchy  by  the  record  "GRAPH" 
which  contains  all  information  about  the  whole  array  at  each 
clock  cycle. 


The  part  of  the  program  that  is  aost  relevant  tc  the 
user  refers  to  the  cell  processor  support  routines.  If  new 
types  of  cells  need  tc  be  simulated,  new  support  subroutines 
aust  be  added  to  the  existing  set.  This  can  be  done  without 
coaplete  knowledge  of  the  prograa  iapleaentation  because  of 
the  acdularity  of  the  design.  How  we  will  refer  specifi¬ 
cally  to  this  kind  of  software  aaintenance. 

C.  DESIGIIIG  AID  INSTALLING  A  HBB  PROCESSOR  SOPPOST 

SOBRCUTIIE 

The  cell  processes  support  subroutines  aodify  the  global 
data  structure  represented  by  the  variable  G  of  type  GRAPH. 
The  existing  set  of  such  subroutines  are  under  the  consent 
titled  "library  routines  for  cell  processors"  in  Appendix  A. 
If  a  new  one  needs  to  be  created  and  added,  it  should 
confozn  with  the  pattern  seen  in  examples.  The  subroutine 
iapleaentation  aust  be  placed  between  the  stateaent  "with 
G. NODES  (.1.)  do  begin"  and  the  statement 

"COLC5_£f OCESSING  (G,  I)  "•  This  last  command  calls  a  subrou¬ 
tine  that  will  compute  the  color  of  the  cell  as  a  result  of 
the  combination  of  the  different  primary  colors  of  the  input 
data. 

The  following  steps  must,  therefore,  be  followed  to 
introduce  a  new  cell  subroutine: 

i.  increase  the  constant  NUMBER_OF_ROOTINES. 

ii.  modify  the  enumerated  type  PROCESSOR_TYPE  to  include 
another  routine  reference  name  (e.g.  E0UTINE_9)  . 

iii.  modify  procedure  DEFINE_ NODES  to  include  references 
to  the  new  subroutine.  This  should  be  done  in  such  a 
way  that  the  new  routine  is  referred  the  same  way  as 
the  others. 

iv.  aodify  procedure  OUTPUT  ARRAY  the  same  way  as  above. 


v.  aodify  procedure  COBBECT_HOD E  as  above. 

vi.  aodify  the  aain  prograa  statement  "case  PBOCESSOR 
cf"  as  above. 

vii.  place  the  body  of  the  new  procedure  next  to  the 
existing  procedures  to  confora  with  the  prcgran 
structure. 

Bints  for  the  design  of  the  nev  procedure: 

the  type  NODE  defines  the  cell  data  structure. 

the  nuaber  of  input/output  ports  in  a  cell  can  be 
aodified  by  changing  the  constant  CONNECTIONS  in  the 
aain  prograa.  If  this  is  done,  the  constant  HAX1INKS 
also  nust  be  aodified  as  instructed  in  the  prcgran 
text  coaaents. 

The  present  iapleaentation  of  SISGBAS  restricts  the 
naxinui  nuaber  of  cells  in  an  array  to  23,  in  order  to 
achieve  a  faster  coaputational  speed.  If  desired,  that  can 
be  modified  by  altering  the  aain  program  constant  MAXNCDES. 
If  this  is  done,  the  constant  NAXLINKS  aust  also  be  aodified 
as  instructed  in  the  prograa  text  coaaents. 


1.  EST1B1ISHIIG  C0HP1BIS0H  P1B1HEIBBS 


The  simulations  that  have  been  studied  in  Chapter  V 
provided  us  a  reasonable  tool  to  use  for  evaluation  of  algo* 
rithms  iiplemented  in  systolic  arrays.  Important  factors 
that  must  be  considered  in  this  kind  of  an  evaluation  are: 

i.  use  of  the  same  type  of  processors  in  other  algo¬ 
rithms 

ii.  average  percentage  of  cells  involved  in  effective 
computation  per  clock  cycle 

iii.  degree  of  homogeneity  of  cells 

iv.  degree  of  the  usage  of  pipelining 

v.  possibility  of  modular  expandability 

vi.  required  number  of  external  connections  to  each  cell 

vii.  cell  complexity 

Unfortunately  not  all  these  factors  can  be  determined  by 
using  this  tool.  Subjectiveness  also  has  to  play  a  role  in 
this  evaluation. 

B.  CCNC1USI0RS  ABOUT  A1G0HITHHS  UNDER  ANALYSIS 

The  iiplementaticn  of  an  algorithm  in  a  systolic  array 
may  provide  a  greater  calculation  speed,  but  a  cost  vas  paid 
to  get  that  achievement. The  cost  can  be  evaluated  in  terms 
of  the  complexity  of  the  design,  the  difficulty  in  iiplemen- 
tation,  the  manufacture  problems  that  can  result  frcm  a 
particular  array  configuration,  etc.  Ke  want  to  make  sure 


that  this  cost  is  not  excessive  and,  ideally,  to  be  minimum. 
These  axe  considerations  that  require  criteria  of  evaluation 
to  be  established.  Instead  of  trying  to  cone  up  with  a 
criteria  that  could  be  the  subject  of  lengthy  discussion, 
the  above  aentioned  factors  are  used  to  discuss  the  effi¬ 
ciency  of  the  algoritbas  with  a  nore  qualitative  than  quan¬ 
titative  approach. 

The  first  factor  ve  want  to  establish  is  that  concerning 
the  average  percentage  of  cells  effectively  involved  in 
active  coaputation  per  clock  cycle  in  each  algorithm.  For 
each  algcritha  we  consider  the  number  of  required  cycles  to 
run  till  completion,  the  number  of  physical  cells  in  the 
systolic  array,  and  the  number  of  cells  involved  in  active 
coaputation  in  each  clock  cycle.  This  last  factor  corre¬ 
sponds  to  the  number  of  cells  shown  on  the  screen  with  a 
color  other  than  black.  Thus  we  do  have: 

Matrix  Triangular ization : 
number  of  cycles  =  9 
total  number  of  cells  =  9 
percentage  of  computation  /  cycle  = 

*  1/9  x  1/9  X  (1+2+4+5+6+5+3+1+0)  » 

=  0.333 

Performing  similar  calculations  for  the  other  algo¬ 
rithms,  ve  obtain  the  numbers  shown  up  in  Table  I. 
Eefinitely,  the  Matrix  Triangularization  has  a  longer  "duty 
cycle"  per  cell  and  this  means  less  waste. 

The  number  of  £ypgs  o£  cell§  required  for  each  algorithm 
is  easily  seen  in  the  pictures  shown  in  Chapter  IV.  Modular 
expandability  is  another  important  feature.  It  implies  the 
possibility  of  interconnecting  chips  (each  one  embedding  a 
whole  systolic  array)  in  a  cascade  fashion  or  seme  ether 
arrangement  where  the  chips  are  used  as  smaller  parts  of  a 
higher  hierarchical  arrangement.  Matrix  Triangularization 


and  Back  Substitution  have  restrictions  in  that  respect  due 
to  the  fact  that  their  cell  arrangeaent  is  not  syaaetxical 
and  such  an  ezpansicn  vould  require  different  types  of 
chips. 

Uith  respect  to  these  factors,  we  can  suaaarize  in  the 
following  table: 


TABLE  111 

Comparison 

of  Algoritha  lapleaentation  Evaluation 
Factors 

1  MATRIX  I  BACK  j  DIV.  1 

MATRIX 

FUTURE 

j  TRIANG.  j  SUBST.j  DIFF.  j 

MUIT. 

computing 
per  cell 


Number  of 
cell  types 

Hodular 

expandability 


0.333 


*  restricted 

*  yes 


The  number  of  external  connections  required  ty  each  cell 
is  pehaps  the  aost  demanding  factor  existing  in  pratical 
implementations.  The  normal  connections  are  power  and  clock 
lines.  Some  algorithxs  like  Matrix  Multiplication  need  only 
these  two.  Some  others  have  synchronization  and  data  feeding 
problems  that  can  only  be  solved  with  the  addition  of  extra 
external  connections  to  the  cell.  This  feature  is  somehow 
related  to  the  degree  of  pipelining  that  can  be  achieved. 


The  fiatrix  Triangularization  and  the  Divided  Differences 
algorithms  can  be  pratically  ixplemented  if  there  is  a  flag 
signal  that  can  trigger  a  pumping  out  mechanism  to  send  the 
data  frozen  in  each  cell  after  the  last  clock  cycle.  To 
recover  the  data  and  display  it  externally,  additional 
connections  are  needed  in  each  cell.  This  increases  the 
total  cost  of  the  chip  and  complicates  the  problem  of 
modular  expandability.  Since  a  design  goal  in  every  VLSI 
implementation  is  to  reduce  the  number  of  external  connec- 
tions,  this  is  a  key  factor  in  the  design  effort. 
Pipelining  possibility  is  also  affected  because  the  pumping 
out  operation  is  not  part  of  the  normal  algorithm  cycle.  It 
is  an  additional  burden  that  may  reguire  the  interruption  of 
the  data  pumping  into  and  the  pipelining  will  have  to  be 
restricted  to  data  of  the  same  problem.  Data  from  different 
problems  can  not  coexist  at  the  array  at  the  same  time.  A 
calculation  of  a  problem  has  to  finish  in  order  to  start 
another.  In  this  sense,  if  ve  examine  the  Matrix 
Multiplication  algorithm,  ve  will  conclude  that  true  pipe¬ 
lining  can  be  achieved.  No  data  flow  interruption  occurs 
because  the  results  are  not  kept  frozen  in  the  cells,  tut 
are  moved  as  part  of  the  data  flow. 

„The  factor  of  ge^l  complexity  is  important  not  only 
because  of  the  possible  high  cost  of  the  hardware,  but  also 
because  cf  the  time  that  may  be  required  for  a  clock  cycle 
to  be  executed.  If  a  longer  time  is  required,  the  clock 
speed  has  to  be  kept  lov  and  the  efficiency  of  the 
processing  decreases.  Simple  operations  are  always  a  goal  on 
the  algorithm  design  because  they  result  simpler  and  faster 
hardware.  This  is  why  another  algorithm  for  Matrix 
Triangularization  without  Square  Boots  has  been  suggested. 


C.  SUGG1STI0HS  FOB  FTTFUBE  HODIFICATIONS  OB  SYSGBAS 

He  are  conscious  that  we  have  not  achieved  an  optical 
design  with  our  simulator  in  terns  of  perforaance.  It  could 
be  inproved  if  the  graphics  iapleaentation  vere  aore  effi¬ 
cient.  This  would  speed  up  the  interaction  with  the  user  and 
aake  it  aore  attractive.  Another  point  that  cculd  be 
inproved  is  the  user  interface.  The  anount  of  inforaation 
that  is  required  from  the  user  requires  an  effort  that  eight 
be  niniaized  if  other  interface  techniques  were  used,  such 
as  conbined  use  of  mouse  or  lightpen  with  the  keyboard 
input. 

Additional  points  that  could  be  modified  to  enhance  the 
presentaticn  at  the  screen  are: 

Elimination  of  non  significant  zeros  from  the  cell 
display  (such  as  turning  000.400  into  0.4) . 

Change  of  the  bidirectional  arrow  that  represents 
ccamunications  in  both  directions  between  cells  into 
unidirected  arrows,  one  for  each  link  between  cell 
ports.  This  wculd  aake  the  cells  appear  on  the  screen 
the  same  way  they  are  represented  in  the  literature. 
An  example  of  the  birectional  arrow  can  be  seen  in 
Fig.  4.36,  in  Chapter  IV,  connecting  cells  05  and  04, 
as  well  as  04  ard  02. 

0.  THE  BOLE  OF  TBE  S3BOLATOR 

t  Our  goal  was  to  contribute  to  the  understanding  cf  the 

implementation  of  algorithms  in  systolic  arrays.  This  led  us 
to  the  realization  of  soae  of  the  inherent  problems  that 
appear  in  this  field.  The  complexity  of  the  data  interac¬ 

tion  in  the  space-time  frame  is  considered  to  be  the 
greatest  obstacle  in  the  understanding  process.  To  decide 
on  the  approach  to  adopt  to  handle  the  problem  is  also 
difficult.  In  response  to  that,  we  designed  a  tool  whose 
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task  is  to  provide  a  software  implementation  of  the  systolic 
array.  Ihe  user  is  left  free  to  concentrate  on  the  algo¬ 
rithm.  lie  have  shown  the  use  of  this  tool  on  the  study  of 
some  algorithms.  This  gave  us  the  opportunity  to  realize  the 
power  of  a  systolic  array  in  the  process  of  performing 
calculations.  Aspects  such  as  computation  time,  memory 
requirements  and  I/O  interactions  are  minimized.  The  mapping 
of  an  algorithm  onto  a  systolic  array  certainly  represents  a 
major  difficulty.  Tie  main  role  of  our  simulator  is  not  to 
help  in  this  mapping,  but  it  can  be  used  in  the  validation 
phase  of  the  algorithm  design.  The  greatest  contribution 
that  we  see  in  SXSGEAS  is  its  versatility.  The  user  can 
model  an  ideal  environment  for  the  algorithm  or  an  environ¬ 
ment  contaminated  by  problems  in  the  underlying  hardware. 
This  certainly  can  lelp  in  achieving  better  results  while 
presenting  a  clear  idea  of  the  robustness  of  an  algorithm. 
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Finally  we  came  up  with  the  system  represented  ty  a  matri 
equation  that  can  te  easily  solved  by  Back  Substitution 
RX  =  CB.  Approximating  elements  r(2,1),  r (3, 1)  and  r(3,2)  to  zer 
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